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1 Computer architecture: A unified processor architecture for RISC & VLIW DSP 
<H> Tay-Jyi Lin, Chie-Min Chao, Chia-Hsien Liu, Pi-Chen Hsiao, Shin-Kai Chen, Li-Chun Lin, Chih- 
Wei Liu, Chein-Wei Jen 

April 2005 Proceedings of the 15th ACM Great Lakes symposium on VLSI 
Publisher: ACM Press 

Full text available: ^] pdf(445.55 KB) Additional Information: full citation , abstract , references , index terms 

This paper presents a unified processor core with two operation modes. The processor 
core works as a compiler-friendly MlPS-like core in the RISC mode, and it is a 4-way 
VLIW in its DSP mode, which has distributed and ping-pong register organization 
optimized for stream processing. To minimize hardware, the DSP mode has no control 
construct for program flow, while the data manipulation RISC instructions are executed in 
the DSP datapath. Moreover, the two operation modes can be changed ins ... 

Keywords: digital signal processor, dual-core processor, register organization, variable- 
length instruction encoding 



2 Instruction fetch mechanisms for VLIW architectures with compressed encodin g s 
Thomas M. Conte, Sanjeev Banerjia, Sergei Y. Larin, Kishore N. Menezes, Sumedh W. 
Sathaye 

December 1996 Proceedings of the 29th annual ACM/IEEE international symposium 
on Microarchitecture 

Publisher: IEEE Computer Society 

Full text available: f* 1 ! pdf(1.34 MB) Additional Information: full citation , abstract, references , citings , index 
' ^ terms 

VLIW architectures use very wide instruction words in conjunction with high bandwidth to 
the instruction cache to achieve multiple instruction issue. This report uses the TINKER 
experimental testbed to examine instruction fetch and instruction cache mechanisms for 
VLIWs. A compressed instruction encoding for VLIWs is defined and a classification 
scheme for i-fetch hardware for such an encoding is introduced. Several interesting cache 
and i-fetch organizations are described and evaluated through ... 

Keywords: TINKER experimental testbed, VLIW architectures, compressed encodings, 
compressed instruction encoding, i-fetch hardware, instruction cache, instruction fetch 
mechanisms, instruction words, multiple instruction issue, parallel architectures, silo 
cache, trace-driven simulations 
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Allon Adir, Yaron Arbetman, Bella Dubrov, Yossi Lichtenstein, Michal Rimon, Michael Vinov, 
Massimo A. Calligaro, Andrew Cofler, Gabriel Duffy 

June 2005 Proceedings of the 42nd annual conference on Design automation 
Publisher: ACM Press 

Full text available: pdf(355.75 KB) Additional Information: full citation , abstract , references , index terms 

Parallelism in processor architecture and design imposes a verification challenge as the 
exponential growth in the number of execution combinations becomes unwieldy. In this 
paper we report on the verification of a Very Large Instruction Word processor. The 
verification team used a sophisticated test program generator that modeled the parallel 
aspects as sequential constraints, and augmented the tool with manually written test 
templates. The system created large numbers of legal stimuli, however ... 

Keywords: VLIW, functional verification, parallelism, processor verification, test 
generation 



4 Session 10B: VLIW exploration and deisgn synthesis: Power exploration for 
embedded VLIW architectures 

Mariagiovanna Sami, Donatella Sciuto, Cristina Silvano, Vittorio Zaccaria 
November 2000 Proceedings of the 2000 IEEE/ ACM international conference on 

Computer-aided design ICCAD '00 
Publisher: IEEE Press 

Full text available: ^] pdf(280.04 KB) Additional Information: full citation , abstract , references , citings 

In this paper, we propose a system-level power exploration methodology for embedded 
VLIW architectures based on an instruction-level analysis. The instruction-level energy 
model targets a general pipeline scalar processor; several architectural parameters such 
as number and type of pipeline stages as well as average stall/latency cycles per 
instruction and inter-instruction effects are taken into account. The application of the 
proposed model to VLIW processors results intractable from the point ... 

5 Low power issues: Compiler-directed thermal mana g ement for VLIW functional units ||j 
Madhu Mutyam, Feihui Li, Vijaykrishnan Narayanan, Mahmut Kandemir, Mary Jane Irwin 
June 2006 Proceedings of the 2006 ACM SIGP LAN/SIGBED conference on Language, 

compilers and tool support for embedded systems LCTES '06 
Publisher: ACM Press 

Full text available: ^ pdf(599.24 KB) Additional Information: full citation , abstract , references , index terms 

As processors, memories, and other components of today's embedded systems are 
pushed to higher performance in more enclosed spaces, processor thermal management 
is quickly becoming a limiting design factor. While previous proposals mostly approached 
this thermal management problem from circuit and architecture angles, software can also 
play an important role in identifying and eliminating thermal hotspots as it is the main 
factor that shapes the order and frequency of accesses to differen ... 

Keywords: IPC, VLIW, thermal 



6 DSP: A resource-shared VLIW processor architecture for area-efficient on-chip 




||> multiprocessing 

^ Kazutoshi Kobayashi, Masao Aramoto, Yoichi Yuyama, Akihiko Higuchi, Hidetoshi Onodera 
January 2005 Proceedings of the 2005 conference on Asia South Pacific design 

automation ASP-DAC '05 
Publisher: ACM Press 

Full text available: ^| pdf(412.71 KB) Additional Information: full citation , abstract , references 

We propose an area-efficient resource-shared VLIW processor (RSVP) for future leaky nm 
process technologies. It consists of several single-way independent processor units (IPUs) 
that share parallel processor resources. Each IPU works as a variable-way VLIW processor 
sharing the parallel resources according to priorities of given tasks. RSVP allocates shared 
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parallel resources to the IPUs cycle by cycle. It can minimize the number of NOPs that 
waste power. The performance per power (P 3 ... 

7 An investigation of static versus dynamic scheduling 
Carl E. Love, Harry F. Jordan 

May 1990 ACM SIGARCH Computer Archi tecture News , Proceedings of the 17th 

annual international symposium on Computer Architecture ISCA '90, volume 
18 Issue 3a 
Publisher: ACM Press 

Full text available: ^|pdf(1.17 MB) Additional Information: full citation , references , index terms 



Low power: Branch prediction techniques for low-power VLIW processors 
G. Palermo, M. Sam, C. Silvan, V. Zaccari, R. Zafalo 

April 2003 Proceedings of the 13th ACM Great Lakes symposium on VLSI 
Publisher: ACM Press 

Full text available* f£) pdf(178 89 KB) Add ^ iona l Information: full citation , abstract , references , citings , index 
: terms 

Main goal of the paper is to introduce a branch prediction scheme suitable for energy- 
efficient VLIW (Very Long Instruction Word) processors aiming at reducing the energy 
associated with the prediction phase by filtering the accesses to the branch predictor 
block. To analyze the effectiveness of the proposed low-power branch prediction scheme, 
we combined it to some well-known dynamic branch prediction techniques suitable for 
VLIW processors. Experimental results have been carried out on Lx, a 4 ... 

Keywords: VLIW processors, branch prediction, low-power design 



9 Memory hierarchies: A code decompression architecture for VLIW processors 
Yuan Xie, Wayne Wolf, Haris Lekatsas 

December 2001 Proceedings of the 34th annual ACM/IEEE international symposium 
on Microarchitecture 

Publisher: IEEE Computer Society 

Full text available: ^ .... nn (f| 

1|j_paT(i.uu Mb) n-k. Additional Information: full citation , abstract , references , citings 
Publisher Site 

In embedded system design, memory has been one of the most restricted resources. 
Reducing program size has been an important goal when designing an embedded system. 
Most of the previous work on code compression has targeted RISC architectures. Recently 
VLIW processors became very popular, particularly for signal processing. Decompression 
speed is especially important for VLIW architectures given that the length of the 
instruction word is long. Furthermore, modern VLIW architectures use flexible ... 

10 Processor and memory design: Distributed loop controller architecture for multi- 
threading in uni-threaded VLIW processors 

Praveen Raghavan, Andy Lambrechts, Murali Jayapala, Francky Catthoor, Diederik Verkest 
March 2006 Proceedings of the conference on Design, automation arid test in Europe: 
Proceedings DATE '06 

Publisher: European Design and Automation Association 

Full text available: ^ pdf(241 .43 KB) Additional Information: full citation , abstract , references 

Reduced energy consumption is one of the most important design goals for embedded 
application domains like wireless, multimedia and biomedical. Instruction memory 
hierarchy has been proven to be one of the most power hungry parts of the system. This 
paper introduces an architectural enhancement for the instruction memory to reduce 
energy and improve performance. The proposed distributed instruction memory 
organization requires minimal hardware overhead and allows execution of multiple loops 
in p ... 
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11 Memory system performance of programs with intensive heap allocation 
Amer Diwan, David Tarditi, Eliot Moss 

August 1995 ACM Transactions on Computer Systems (TOCS), volume 13 issue 3 
Publisher: ACM Press 

Full text available" fijll pdf(2.10 MB) Additional Information: full citation , abstract , references , citings , index 

terms 

Heap allocation with copying garbage collection is a general storage management 
technique for programming languages. It is believed to have poor memory system 
performance. To investigate this, we conducted an in-depth study of the memory system 
performance of heap allocation for memory systems found on many machines. We studied 
the performance of mostly functional Standard ML programs which made heavy use of 
heap allocation. We found that most machines support heap allocation poorly. Howeve ... 

Keywords: automatic storage reclamation, copying garbage collection, garbage 
collection, generational garbage collection, heap allocation, page mode, subblock 
placement, write through, write-back, write-buffer, write-miss policy, write-policy 



12 Partitioned register files for VLIWs: a preliminary analysis of tradeoffs 
Andrea Capitanio, Nikil Dutt, Alexandru Nicolau 

December 1992 ACM SIGMICRO Newsletter , Proceedings of the 25th annual 

international symposium on Microarchitecture MICRO 25, volume 23 issue 

1-2 

Publisher: IEEE Computer Society Press, ACM Press 

Full text available: ^ ]pdf(1.09 MB) Additional Information: full citation , references , citings , index terms 




13 Micro processor architecture: Automatic generation of application specific processors 
David Goodwin, Darin Petkov 

October 2003 Proceedings of the 2003 international conference on Compilers, 

architecture and synthesis for embedded systems 
Publisher: ACM Press 

Full text available: ffl pdf(231.13 KB) Additional Information: full citation , abstract, references , ritjngs, index 
L - J terms 

An application-specific instruction-set processor (ASIP) is ideally suited for embedded 
applications that have demanding performance, size, and power requirements that cannot 
be satisfied by a general purpose processor. ASIPs also have time-to-market and 
programmability advantages when compared to custom ASICs. The AutoTIE system 
simplifies the creation of ASIPs by automatically enhancing a base processor with 
application specific instruction set architecture (ISA) extensions, including instruct ... 

Keywords: ASIPs, automatic instruction-set generation, configurable processors, 
extensible processors 



14 Instruction-level power estimation for embedded VLIW cores 

#M. Sami, D. Sciuto, C. Silvano, V. Zaccaria 
May 2000 Proceedings of the eighth international workshop on Hardware/software 

codesign 
Publisher: ACM Press 

Full text available- fQ pdf(226.QQ KB) A^' 1 ' 0 " 31 Information: full citation , abstract, references , citings, index 
* ^ terms 

In this paper, a power estimation methodology operating at the instruction-level is 
proposed. The methodology is tightly related to the characteristics of the system 
architecture, mainly in terms of one or more target processors, the memory sub-system, 
the system-level buses and the coprocessors. In this system-level framework, our main 
goal is to define a power model for CPU cores at the instruction-level. First, the proposed 
power model deals with a general five-stage pipeline processor a ... 
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16 Design space exploration for embedded systems: Energy estimation and optimization jij 

of embedded VLIW processors based on instruction clustering 
^ A. Bona, M. Sami, D. Sciuto, V. Zaccaria, C. Silvano, R. Zafalon 

June 2002 Proceedings of the 39th conference on Design automation 

Publisher: ACM Press 

Full text available: g pdf(308.00 KB) Additional information: full citation , abstract, references , citings, index 

terms 

Aim of this paper is to propose a methodology for the definition of an instruction-level 
energy estimation framework for VLIW (Very Long Instruction Word) processors. The 
power modeling methodology is the key issue to define an effective energy-aware 
software optimisation strategy for state-of-the-art ILP (Instruction Level Parallelism) 
processors. The methodology is based on an energy model for VLIW processors that 
exploits instruction clustering to achieve an efficient and fine grained energy ... 

Keywords: power estimation, vliw architectures 



16 Com piler-driven cached code compression schemes for embedded ILP processors 
Sergei Y. Larin, Thomas M. Conte 

November 1999 Proceedings of the 32nd annual ACM/IEEE international symposium 
on Microarchitecture 

Publisher: IEEE Computer Society 

Full text available: ^ ^ MB) ^ 11 Additional Information: full citation , abstract , references , citings , index 
Publisher Site terms 

During the last 15 years, embedded systems have grown in complexity and performance 
to rival desktop systems. The architectures of these systems present unique challenges to 
processor microarchitecture, including instruction encoding and instruction fetch 
processes. This paper presents new techniques for reducing embedded system code size 
without reducing functionality. This approach is to extract the pipeline decoder logic for an 
embedded VLIW processor in software at system develo ... 

17 Parallel and distributed systems and networking: Performance evaluation for a 
<H> compressed-VLIW processor 

Sunghyun Jee, Kannappan Palaniappan 

March 2002 Proceedings of the 2002 ACM symposium on Applied computing 

Publisher: ACM Press 

Full text available: Q pdf (498.01 KB) Additional Information: full citation , abstract , references , index terms 

This paper presents a new ILP processor architecture called Compressed VLIW (CVLIW). 
The CVLIW processor constructs a sequence of long instructions by removing nearly all 
NOPs (No Operations) and LNOPs (Long NOPs) from VLIW code. The CVLIW processor 
individually schedules each instruction within long instructions using functional unit and 
dynamic scheduler pairs. Every dynamic scheduler in the CVLIW processor individually 
checks for data dependencies and resource collisions while scheduli ... 

Keywords: CVLIW processor, ILP, VLIW, individual instruction scheduling 



18 Regular contributions: DSP architectures: past, present and futures 
Edwin J. Tan, Wendi B. Heinzelman 

June 2003 ACM SIGARCH Computer Archi tecture News, Volume 31 issue 3 
Publisher: ACM Press 

Full text available: ^ pdf(1.27 MB) Additional Information: full citation , abstract , references 

As far as the future of communication is concerned, we have seen that there is great 
demand for audio and video data to complement text. Digital signal processing (DSP) is 
the science that enables traditionally analog audio and video signals to be processed 
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digitally for transmission, storage, reproduction and manipulation. In this paper, we will 
explain the various DSP architectures and its silicon implementation. We will also discuss 
the state-of-the art and examine the issues pertaining to pe ... 

19 Embedded systems: applications, solutions and techniques (EMBS): Motion j| 
<g> estimation performance of the TM3270 processor 

^ Jan-Willem van de Waerdt, Gerrit A. Slavenburg, Jean-Paul van Itegem, Stamatis Vassiliadis 
March 2005 Proceedings of the 2005 ACM symposium on Applied computing SAC '05 
Publisher: ACM Press 

Full text available: ^pdf(1 23.65 KB) Additional Information: full citation , abstract , references , index terms 

Motion estimation constitutes a significant computational part of video standards such as 
MPEG2, MPEG4, and H264/AVC. This paper evaluates the performance of a motion 
estimation algorithm on the TM3270, a low-cost media-processor. In order to improve 
performance, the TM3270 processor provides architectural enhancements over previous 
TriMedia processors. We quantify the speedup of the proposed new operations to motion 
estimation performance. We show that the new operations incorporated in ... 

Keywords: media processor, motion estimation, software implementation 



20 Embedded system architectures: Instruction buffering exploration for low energy 
VLlWs with instruction clusters 

Tom Vander Aa, Murali Jayapala, Francisco Barat, Geert Deconinck, Rudy Lauwereins, 
Francky Catthoor, Henk Corporaal 

January 2004 Proceedings of the 2004 conference on Asia South Pacific design 
automation: electronic design and solution fair ASP-DAC '04 , 
Proceedings of the 2004 conference on Asia South Pacific design 
automation: electronic design and solution fair ASP-DAC '04 

Publisher: IEEE Press , IEEE Press 

Full text available: fE| pdf(393.70 KB) 

JsjJ Additional Information: full citation , abstract , references 

^ Publisher Site 

For multimedia applications, loop buffering is an efficient mechanism to reduce the power 
in the instruction memory of embedded processors. In particular, software controlled 
clustered loop buffers are energy efficient. However current compilers for VLIW do not 
fully exploit the potentials offered by such a clustered organization This paper presents an 
algorithm to explore what is the optimal loop buffer configuration and the optimal way to 
use this configuration for an application or a s ... 
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1 Code size minimization and retar g etable assembly for custom EPIC and VLIW 
instruction formats 



Shail Aditya, Scott A. Mahlke, B. Ramakrishna Rau 

October 2000 ACM Transactions on Design Automation of Electronic Systems 

(TODAES), Volume 5 Issue 4 
Publisher: ACM Press 

Full text available: "g| pdf(568.33 KB) Additional Information: full citation , abstract , references , index terms 

PICO is a fully automated system for designing the architecture and the microarchitecture 
of VLIW and EPIC processors. A serious concern with this class of processors, due to their 
very long instructions, is their code size. One focus of this paper is to describe a series of 
code size minimization techniques used within PICO, some of which are applied during the 
automatic design of the instruction format, while others are applied during program 
assembly. The design of a retargetable assembl ... 

Keywords: EPIC, VLIW, code size minimization, custom templates, design automation, 
instruction format design, noop compression, retargetable assembly 



2 Partitioned register files for VLIWs: a preliminary analysis of tradeoffs 
Andrea Capitanio, Nikil Dutt, Alexandru Nicolau 

December 1992 ACM SIGMICRO Newsletter, Proceedings of the 25th annual 

international symposium on Microarchitecture MICRO 25, volume 23 issue 

1-2 

Publisher: IEEE Computer Society Press, ACM Press 

Full text available: ^| pdf(1.09 MB) Additional Information: full citation , references , citin gs, index terms 




Cluster assi g nment for high-performance embedded VLIW processors 
Viktor S. Lapinskii, Margarida F. Jacome, Gustavo A. De Veciana 

July 2002 ACM Transactions on Design Automation of Electronic Systems (TODAES), 

Volume 7 Issue 3 
Publisher: ACM Press 

Full text available* fiD pdf(226.89 KB) Add ^ onaI Information: full citation , abstract , references , citings , index 
TSJ-^— * : terms 

Clustering is an effective method to increase the available parallelism in VLIW datapaths 
without incurring severe penalties associated with a large number of register file ports. 
Efficient utilization of a clustered datapath requires careful binding/assignment of 
operations to clusters. The article proposes a binding algorithm that effectively explores 
trade-offs between in-cluster operation serialization and delays associated with data 
transfers between clusters. Extensive experimental evidenc ... 
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Keywords: Operation binding, clustered VLIW datapaths, embedded processors, 
embedded systems, partitioning 



Reducing d ynamic and leakage energy in VLIW architectures [ 
W. Zhang, Y.-F. Tsai, D. Duarte, N. Vijaykrishnan, M. Kandemir, M. J. Irwin 
February 2006 ACM Transactions on Embedded Computing Systems (TECS), volume 5 
Issue 1 

Publisher: ACM Press 

Full text available: ^ pdf(1,2Q MB) Additional Information: full citation , abstract , references , index terms 

The mobile computing device market has been growing rapidly. This brings the 
technologies that optimize system energy to the forefront. As circuits continue to scale in 
the future, it would be important to optimize both leakage and dynamic energy. Effective 
optimization of leakage and dynamic energy consumption requires a vertical integration of 
techniques spanning from circuit to software levels. Schedule slacks in codes executing in 
VLIW architectures present an opportunity for such an integra ... 

Keywords: VLIW architecture, compiler, dynamic energy, leakage energy, schedule 
slacks 



5 Evaluation of scheduling techniques on a SPARC-based VLIWtestbed 
Seongbae Park, SangMin Shim, Soo-Mook Moon 

December 1997 Proceedings of the 30th annual ACM/IEEE international symposium 

on Microarchitecture 
Publisher: IEEE Computer Society 

Full text available: ^ pdfCMQ MB ) ^ Additional Information: full citation , abstract , references , citings , index 
Publisher Site ^OHS 

The performance of Very Long Instruction Word (VLIW) microprocessors depends on the 
close cooperation between the compiler and the architecture. This paper evaluates a set 
of important compilation techniques and related architectural features for VLIW machines. 
The evaluation is performed on a SPARC-based VLIW testbed where gcc-generated 
optimized SPARC code is scheduled into high-performance VLIW code. As a base 
scheduling compiler, we experiment with three core scheduling techniques including ... 

Keywords: SPARC-based VLIW testbed, VLIW microprocessors, Very Long Instruction 
Word microprocessors, all-path speculation, compiler, computer architecture, copies, gcc- 
generated optimized SPARC code, high-performance VLIW code, loop unrolling, memory 
disambiguation, nongreedy enhanced pipeline scheduling, nonspeculative operations, 
parallel machines, performance, profile-based all-path speculation, renaming, restricted 
speculative loads, scheduling compiler, scheduling techniques, software pipelining, 
speculative operations, trace-based speculation 



6 D ynamically scheduled VLIW processors B 
B. Ramakrishna Rau 

December 1993 Proceedings of the 26th annual international symposium on 
Microarchitecture 

Publisher: IEEE Computer Society Press 

Full text available: g| pdf(1.64 MB) Additional Information: full citation , references , citings 



Keywords: VLIW processors, dynamic scheduling, multiple operation issue, out-of-order 
execution, scoreboarding 
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configurable VL1W processors [ 
Yuki Kobayashi, Shinsuke Kobayashi, Koji Okuda, Keishi Sakanushi, Yoshinori Takeuchi, 
Masaharu Imai 

January 2004 Proceedings of the 2004 conference on Asia South Pacific design 
automation: electronic design and solution fair ASP-DAC '04 , 
Proceedings of the 2004 conference on Asia South Pacific design 
automation: electronic design and solution fair ASP-DAC '04 

Publisher: IEEE Press , IEEE Press 

Full text available: g pdff21 7.47 KB) 

S Additional Information: full citation , abstract , references 

™ Publisher Site 

This paper proposes a synthesizable HDL code generation method using a processor 
specification description. The proposed approach can change the number of slots and 
pipeline stages, and dispatching rule to assign operations to resources. In addition, 
designers can specify each instruction behavior using the specification language. A control 
logic, a decode logic, and a data path of VLIW processor are generated from the processor 
specification. Designers can explore ASIP design space using the pr ... 

An efficient resource-constrained global scheduling technique for superscalar and j 

VLIW processors 

Soo-Mook Moon, Kemal Ebcioglu 

December 1992 ACM SIGMICRO Newsletter, Proceedings of the 25th annual 

international symposium on Microarchitecture MICRO 25, volume 23 issue 
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Publisher: IEEE Computer Society Press, ACM Press 
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Keywords: VLIW, compile-time parallelization, instruction-level parallelism, superscalar 



9 Parallelizing nonnumerical code with selective scheduling and software pipelining 
Soo-Mook Moon, Kemal Ebcioglu 

November 1997 ACM Transactions on Programming Languages and Systems 

(TOPLAS), Volume 19 Issue 6 
Publisher: ACM Press 

Full text available' f£) odf(543 93 KB) Additional Information: full citation , abstract , references , citings , index 

terms 

Instruction-level parallelism (ILP) in nonnumerical code is regarded as scarce and hard to 
exploit due to its irregularity. In this article, we introduce a new code-scheduling 
technique for irregular ILP called "selective scheduling" which can be used as a 
component for superscalar and VLIW compilers. Selective scheduling can compute a wide 
set of independent operations across all execution paths based on renaming and forward- 
substitution and can compute availab ... 

Keywords: VLIW, global instruction scheduling, instruction-level parallelism, software 
pipelining, speculative code motion, superscalar 



10 Compiler-Directed ILP Extraction for Clustered VLIW/EPIC Machines: Predication, 
Speculation and Modulo Scheduling 
Satish Pillai, Margarida F. Jacome 
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Compiler-directed ILP extraction techniques are critical to effectively exploiting the 
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significant processing capacity of contemporaneous VLIW/EPIC machines. In this paper 
we propose a novel algorithm for ILP extraction targeting clustered EPIC machines that 
integrates three powerful techniques: predication, speculation and modulo scheduling. In 
addition, our framework schedules and binds operations, generating actual VLIW code. To 
the best of our knowledge, there is no other algorithm in the li ... 

11 A survey of processors with explicit multithreading Q 
Theo Ungerer, Borut Robic, Jurij Silc 

March 2003 ACM Computing Surveys (CSUR), volume 35 issue i 
Publisher: ACM Press 

Full text available: fB pdff920.16 KB) Additional Information: full citation , abstract , references , citings, index 
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Hardware multithreading is becoming a generally applied technique in the next generation 
of microprocessors. Several multithreaded processors are announced by industry or 
already into production in the areas of high-performance microprocessors, media, and 
network processors. A multithreaded processor is able to pursue two or more threads of 
control in parallel within the processor pipeline. The contexts of two or more threads of 
control are often stored in separate on-chip register sets. Unused i ... 

Keywords: Blocked multithreading, interleaved multithreading, simultaneous 
multithreading 

12 Instantaneous current modelin g in a comp lex VLIW processor core 
Radu Muresan, Catherine Gebotys 

May 2005 ACM Transactions on Embedded Computing Systems (TECS), Volume 4 issue 2 
Publisher: ACM Press 

Full text available: ^ pdf(3.64 MB ) Additional Information: full citation , abstract , references , index terms 

Measuring and modeling instantaneous current consumption or current dynamics of a 
processor is important in embedded system designs, wireless communications, low- 
energy mobile computing, security of communications, and reliability. In this paper, we 
introduce a new instruction-level based macromodeling approach for instantaneous 
current consumption in a complex processor core along with new instantaneous current 
measurement techniques at the instruction and program level. Current consumption 
and ... 

Keywords: Instruction-level current model, current and power measurement in a 
processor, instantaneous current model, power and energy model 



13 Enhanced modulo scheduling for loops with conditional branches 
Nancy J. Warter, Grant E. Haab, Krishna Subramanian, John W. Bockhaus 
December 1992 ACM SIGMICRO Newsletter , Proceedings of the 25th annual 

international symposium on Microarchitecture MICRO 25; volume 23 issue 

1-2 

Publisher: IEEE Computer Society Press, ACM Press 

Full text available: ^] pdf(1.21 MB) Additional Information: full citation , references , citings , index terms 




14 Regular contributions: DSP architectures: past, present and futures 
^ Edwin J. Tan, Wendi B. Heinzelman 

June 2003 ACM SIGARCH Computer Architecture News, Volume 31 issue 3 

Publisher: ACM Press 

Full text available: ^g) pdf(1.27 MB) Additional Information: full citation , abstract , references 

As far as the future of communication is concerned, we have seen that there is great 
demand for audio and video data to complement text. Digital signal processing (DSP) is 
the science that enables traditionally analog audio and video signals to be processed 
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digitally for transmission, storage, reproduction and manipulation. In this paper, we will 
explain the various DSP architectures and its silicon implementation. We will also discuss 
the state-of-the art and examine the issues pertaining to pe ... 

15 Parallel and distributed systems and networking: Performance evaluation for a 

<g> compressed-VLIW processor 

^ Sunghyun Jee, Kannappan Palaniappan 

March 2002 Proceedings of the 2002 ACM symposium on Applied computing 

Publisher: ACM Press 

Full text available: ^ pdf(498.01 KB) Additional Information: full citation , abstract , references , index terms 

This paper presents a new ILP processor architecture called Compressed VLIW (CVLIW) 
The CVLIW processor constructs a sequence of long instructions by removing nearly all 
NOPs (No Operations) and LNOPs (Long NOPs) from VLIW code. The CVLIW processor 
individually schedules each instruction within long instructions using functional unit and 
dynamic scheduler pairs. Every dynamic scheduler in the CVLIW processor individually 
checks for data dependencies and resource collisions while scheduli ... 

Keywords: CVLIW processor, ILP, VLIW, individual instruction scheduling 




16 Reconfigurable system: A run-time word-level ^configurable coarse-grain functional 

unit for a VLIW processor 
^ Natalino G. Busa, Carles Rodoreda Sala 

October 2002 Proceedings of the 15th international symposium on System Synthesis 

Publisher: ACM Press 

Full text available: ^ pdf(492.13 KB) Additional Information: full citation , abstract , references , index terms 

Nowadays, new DSP applications are offering combined and flexible multimedia and 
telecom services. VLIW processor architectures, which include dedicated but inflexible 
functional units, are usually tuned to a single specific application. In order to accelerate a 
wide range of applications, we propose a VLIW processor containing a novel run-time 
reconfigurable functional unit (RC-FU). Only a few hundred bits and few cycles are 
necessary to configure a new coarse-grain operation on the RC-FU unit. ... 

Keywords: VLIW processors, architectural synthesis, reconfigurable logic 
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In this paper we present a set of isomorphic control transformations that allow the 
compiler to apply local scheduling techniques to acyclic subgraphs of the control flow 
graph. Thus, the code motion complexities of global scheduling are eliminated. This 
approach relies on a new technique, Reverse If-Conversion (RIC), that transforms 
scheduled If-Converted code back to the control flow graph representation. This paper 
presents the predicate internal representation, the algorithms for RIC, a ... 
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pipelining algorithm for clustered embedded VLIW processors 
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Publisher: IEEE Press 
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In this paper we describe a software pipelining framework, CALiBeR (Cluster Aware Load 
Balancing Retiming Algorithm), suitable for compilers targeting clustered embedded VLIW 
processors. CALiBeR can be effectively used by embedded system designers to explore 
different code optimization alternatives, i.e., can assist the generation of high-quality 
customized retiming solutions for desired program memory size and throughput 
requirements, while minimizing register pressure. An extensive set of expe ... 
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1 Low Power: Power-aware branch prediction techniques: a compiler-hints based 
^ a pproach for VLIW processors 

^ M. Monchiero, G. Palermo, M. Sami, C. Silvano, V. Zaccaria, R. Zafalon 

April 2004 Proceedings of the 14th ACM Great Lakes symposium on VLSI 
Publisher: ACM Press 

Full text available: ^g) pdf(151.69 KB) Additional Information: full citation , abstract , references , index terms 

Main goal of the paper is introducing a dynamic branch prediction scheme suitable for 
energy-aware VLIW (Very Long Instruction Word) processors. The proposed technique is 
based on a compiler hint mechanism to filter the accesses to the branch predictor blocks. 
Experimental results have been carried out on Lx/ST200, an industrial 4-issue VLIW 
architecture. We gathered two sets of results: First, by introducing the proposed low- 
power branch prediction technique in the Lx processor, which fe ... 

Keywords: VLIW processors, branch prediction, low-power design 



Embedded systems: A VLIW low power Java processor for embedded applications j 
Antonio Carlos S. Beck, Luigi Carro 

September 2004 Proceedings of the 17th symposium on Integrated circuits and 

system design 
Publisher: ACM Press 

Full text available: ^| pdf(303.8Q KB) Additional Information: full citation , abstract , references , index terms 

This paper presents a pioneer VLIW architecture of a native Java processor. We show 
that, thanks to the specific stack architecture and to the use of the VLIW technique, one is 
able to obtain a meaningful reduction of power dissipation, with small area overhead, 
when compared to other ways of executing Java in hardware. The underlying technique is 
based on the reuse of memory access instructions, hence reducing power during memory 
or cache accesses. The architecture is validated for some complex ... 



Keywords: Java, VLIW, power consumption 
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Low power: Branch prediction techniques for low-power VLIW processors 
G. Palermo, M. Sam, C. Silvan, V. Zaccari, R. Zafalo 

April 2003 Proceedings of the 13th ACM Great Lakes symposium on VLSI 
Publisher: ACM Press 

Full text available- t" 1 ! pdf(1 78 89 KB) Adc,it ' onal Information: full citation , abstract , references , citings , index 

: terms 

Main goal of the paper is to introduce a branch prediction scheme suitable for energy- 
efficient VLIW (Very Long Instruction Word) processors aiming at reducing the energy 
associated with the prediction phase by filtering the accesses to the branch predictor 
block. To analyze the effectiveness of the proposed low-power branch prediction scheme, 
we combined it to some well-known dynamic branch prediction techniques suitable for 
VLIW processors. Experimental results have been carried out on Lx, a 4 ... 

Keywords: VLIW processors, branch prediction, low-power design 



5 Ex ploiting data forwarding to reduce the power budget of VLIW embedded 
processors 

M. Sami, D. Sciuto, C. Silvano, V. Zaccaria, R. Zafalon 

March 2001 Proceedings of the conference on Design, automation and test in Europe 
Publisher: IEEE Press 
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Keywords: VLIW embedded architectures, forwarding, low-power, pipeline processors 



New design techniques for application specific processors: A loop accelerator for low 
power embedded VLIW processors 
Binu Mathew, Al Davis 

September 2004 Proceedings of the 2nd IEEE/ACM/IFIP international conference on 

Hardware/ software codesign and system synthesis 
Publisher: ACM Press 9 

Full text available: ff) pd«221.21 KB) Additional Information: fall citation , abstract, references , citings, index 
" terms 

The high transistor density afforded by modern VLSI processes have enabled the design 
of embedded processors that use clustered execution units to deliver high levels of 
performance. However, delivering data to the execution resources in a timely manner 
remains a major problem that limits ILP. It is particularly significant for embedded 
systems where memory and power budgets are limited. A distributed address generation 
and loop acceleration architecture for VLIW processors is presented. This de ... 

Keywords: VLIW, embedded systems, low power design 
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^ quadruple 8-way VLIW processors with interface timing analysis considering power 
sup ply noise 

Satoshi Imai, Atsuki Inoue, Motoaki Matsumura, Kenichi Kawasaki, Atsuhiro Suga 
January 2006 Proceedings of the 2006 conference on Asia South Pacific design 

automation ASP-DAC '06 
Publisher: ACM Press 

Full text available: ^ pdf(479.55 KB) Additional Information: full citation , abstract , references 

This paper introduces a 51.2Gops, l.OGB/s-DMA single-chip multi-processor integrating 
quadruple cores and proposes a new power integrity analysis. Our multi-processor is 
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designed to decode MP@HL streams without any dedicated circuits. To achieve such high 
performance, data throughput as well as processing capability is important, requiring a 
large number of high speed I/Os. However, this makes for a high level of power supply 
noise. We then applied an interface timing margin analysis tool that t ... 

Power minimization derived from architectural-usage of VLIW processors 
C. Gebotys, R. Gebotys, S. Wiratunga 

June 2000 Proceedings of the 37th conference on Design automation 
Publisher: ACM Press 

Full text available- Ha pdf(443.06 KB) Additiona! Information: full citation , abstract, references , citings, index 
^ terms 

This paper presents an empirical approach to inferring low power code generation 
techniques for VLIW processors. Architectural usage variables are used to generate 
equations for power prediction which are in turn used to infer new code generation 
techniques for low power. Unlike previous techniques, the methodology empirically 
derives a power prediction equation and then based upon the coefficients of the 
architectural-usage variables identifies new VLIW code generation techniques f 



9 Session 10B: VLIW exploration and deisgn synthesis: Power exploration for 
embedded VLIW architectures 

Mariagiovanna Sami, Donatella Sciuto, Cristina Silvano, Vittorio Zaccaria 
November 2000 Proceedings of the 2000 IEEE/ACM international conference on 

Computer-aided design ICCAD 'OO 
Publisher: IEEE Press 

Full text available: ^ pdf(280.Q4 KB) Additional Information: full citation , abstract , references , citings 

In this paper, we propose a system-level power exploration methodology for embedded 
VLIW architectures based on an instruction-level analysis. The instruction-level energy 
model targets a general pipeline scalar processor; several architectural parameters such 
as number and type of pipeline stages as well as average stall/latency cycles per 
instruction and inter-instruction effects are taken into account. The application of the 
proposed model to VLIW processors results intractable from the point ... 

10 Desig n space exploration for embedded systems: Energy estimation and optimization |g| 
of embedded VLIW processors based on instruction clustering 
A. Bona, M. Sami, D. Sciuto, V. Zaccaria, C. Silvano, R. Zafalon 
June 2002 Proceedings of the 39th conference on Design automation 

Publisher: ACM Press 

Full text available: tP pdfaos.OO KB) Additional Information: full citation , abstract, references , citings, index 

terms 

Aim of this paper is to propose a methodology for the definition of an instruction-level 
energy estimation framework for VLIW (Very Long Instruction Word) processors. The 
power modeling methodology is the key issue to define an effective energy-aware 
software optimisation strategy for state-of-the-art ILP (Instruction Level Parallelism) 
processors. The methodology is based on an energy model for VLIW processors that 
exploits instruction clustering to achieve an efficient and fine grained energy ... 
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Embedded wireless devices require secure high-performance cryptography in addition to 
low-cost and low-energy dissipation. This paper presents for the first time a design 
methodology for security on a VLIW complex DSP-embedded processor core. Elliptic curve 
cryptography is used to demonstrate the design for security methodology. Results are 
verified with real dynamic power measurements and show that compared to previous 
research a 79&percnt; improvement in performance is achieved. Modification o ... 
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12 Ap plication specific processors: A low power architecture for embedded perception Q 

#Binu Mathew, Al Davis, Mike Parker 
September 2004 Proceedings of the 2004 international conference oh Compilers, 

architecture, and synthesis for embedded systems 
Publisher: ACM Press 

Full text available: pdf(310.49 KB) Additional Information: full citation , abstract , references , index terms 

Recognizing speech, gestures, and visual features are important interface capabilities for 
future embedded mobile systems. Unfortunately, the real-time performance requirements 
of complex perception applications cannot be met by current embedded processors and 
often even exceed the performance of high performance microprocessors whose energy 
consumption far exceeds embedded energy budgets. Though custom ASICs provide a 
solution to this problem, they incur expensive and lengthy design cycles and ... 

Keywords: VLIW, computer vision, embedded systems, low power design, perception, 
speech recognition, stream processor 



13 S ynthesis for Low Power: Current consumption dynamics at instruction and pro g ram 
^ level for a VLIW DSP processor 
^ Radu Muresan, Catherine H. Gebotys 

September 2001 Proceedings of the 14th international symposium on Systems 
synthesis 

Publisher: ACM Press 

Full text available- IS odf(707 88 KB) Additional Information: full citation , abstract , references , citings , index 
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This paper describes a new methodology for analyzing low-level current dynamics at the 
instruction level and the program level for a VLIW DSP processor core. An efficient 
methodology for software power analysis is presented which unlike other research 
supports dynamic current analysis and complex VLIW processor cores. Analysis of high 
bank register allocation, equivalent functional construct usage, and program-based 
current, power, and energy is presented. The basic principles and methods develo ... 

Keywords: DSP processors, current dynamics, methodology 



14 Low power design for embedded and real-time systems: Instruction scheduling of 
^ VLIW architectures for balanced power consumption 
^ Shu Xiao, Edmund M-K. Lai 

January 2005 Proceedings of the 2005 conference on Asia South Pacific design 
automation ASP-DAC '05 

Publisher: ACM Press 

Full text available: *g| pdf(408.47 KB) Additional Information: full citation , abstract , references 

An instruction word in VLIW (very long instruction word) processors consists of a variable 
number of individual instructions. Therefore the power consumption variation over time 
significantly depends on the parallel instruction schedule generated by the compiler. 
Sharp power variations across time cause power supply noises, degrade chip reliability 
and accelerate battery exhaustion. This paper proposes a branch and bound algorithm for 
instruction scheduling of VLIW architectures that effectively ... 
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15 Instruction-level power estimation for embedded VLIW cores 
M. Sami, D. Sciuto, C. Silvano, V. Zaccaria 

May 2000 Proceedings of the eighth international workshop on Hardware/software 

codesign 
Publisher: ACM Press 

Full text available: Wj pdf(226.00 KB) Additional Information: full citation , abstract , references , citings , index 
^ : terms 

In this paper, a power estimation methodology operating at the instruction-level is 
proposed. The methodology is tightly related to the characteristics of the system 
architecture, mainly in terms of one or more target processors, the memory sub-system, 
the system-level buses and the coprocessors. In this system-level framework, our main 
goal is to define a power model for CPU cores at the instruction-level. First, the proposed 
power model deals with a general five-stage pipeline processor a ... 

16 DSP: A resource-shared VLIW processor architecture for area-efficient on-chip 
<g> multiprocessing 

^ Kazutoshi Kobayashi, Masao Aramoto, Yoichi Yuyama, Akihiko Higuchi, Hidetoshi Onodera 
January 2005 Proceedings of the 2005 conference on Asia South Pacific design 

automation ASP-DAC '05 
Publisher: ACM Press 

Full text available: Q pdf(412.71 KB) Additional Information: full citation , abstract , references 

We propose an area-efficient resource-shared VLIW processor (RSVP) for future leaky nm 
process technologies. It consists of several single-way independent processor units (IPUs) 
that share parallel processor resources. Each IPU works as a variable-way VLIW processor 
sharing the parallel resources according to priorities of given tasks. RSVP allocates shared 
parallel resources to the IPUs cycle by cycle. It can minimize the number of NOPs that 
waste power. The performance per power (P 3 ... 

17 Instantaneous current modeling in a complex VLIW processor core 
Radu Muresan, Catherine Gebotys 

May 2005 ACM Transactions on Embedded Computing Systems (TECS), volume 4 issue 2 
Publisher: ACM Press 

Full text available:^ pdf(3.64 MB) Additional Information: full citation , abstract , references , index terms 

Measuring and modeling instantaneous current consumption or current dynamics of a 
processor is important in embedded system designs, wireless communications, low- 
energy mobile computing, security of communications, and reliability. In this paper, we 
introduce a new instruction-level based macromodeling approach for instantaneous 
current consumption in a complex processor core along with new instantaneous current 
measurement techniques at the instruction and program level. Current consumption 
and ... 

Keywords: Instruction-level current model, current and power measurement in a 
processor, instantaneous current model, power and energy model 



18 Low power processors: A hamming distance based VLIW/EPIC code compression 
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September 2004 Proceedings of the 2004 international conference on Compilers, 
architecture, and synthesis for embedded systems 

Publisher: ACM Press 

Full text available: ^| pdf(132.73 KB) Additional Information: full citation , abstract , references , index terms 

This paper presents and reports on a VLIW code compression technique based on vector 
Hamming distances. It investigates the appropriate selection of dictionary vectors such 
that all program vectors are at most a specified maximum Hamming distance from a 
dictionary vector. Bit toggling information is used to restore the original vector.A 
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dictionary vector selection method which considered both vector frequency as well as 
maximum coverage achieved better results than just considering vector freque ... 

Keywords: VLIW, code compression, hamming distance 
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Publisher: IEEE Computer Society Press 
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Multimedia processing on embedded devices requires an architecture that leads to high 
performance, low power consumption, reduced design complexity, and small code size. In 
this paper, we use EEMBC, an industrial benchmark suite, to compare the VIRAM vector 
architecture to superscalar and VLIW processors for embedded multimedia applications. 
The comparison covers the VIRAM instruction set, vectorizing compiler, and the prototype 
chip that integrates a vector processor with DRAM main memory. We de ... 

20 Reconfigurable system: A run-time word-level reconfigurable coarse-grain functional j| 
unit for a VLIW processor 
Natalino G. Busa, Carles Rodoreda Sala 

October 2002 Proceedings of the 15th international symposium on System Synthesis 
Publisher: ACM Press 

Full text available:^ pdf(492. 13 KB) Additional Information: full citation , abstract , references , index terms 

Nowadays, new DSP applications are offering combined and flexible multimedia and 
telecom services. VLIW processor architectures, which include dedicated but inflexible 
functional units, are usually tuned to a single specific application. In order to accelerate a 
wide range of applications, we propose a VLIW processor containing a novel run-time 
reconfigurable functional unit (RC-FU). Only a few hundred bits and few cycles are 
necessary to configure a new coarse-grain operation on the RC-FU unit. ... 

Keywords: VLIW processors, architectural synthesis, reconfigurable logic 
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In embedded system design, memory has been one of the most restricted resources. 
Reducing program size has been an important goal when designing an embedded system. 
Most of the previous work on code compression has targeted RISC architectures. Recently 
VLIW processors became very popular, particularly for signal processing. Decompression 
speed is especially important for VLIW architectures given that the length of the 
instruction word is long. Furthermore, modern VLIW architectures use flexible ... 

Compiler-driven cached code compression schemes for embedded ILP processors 
Sergei Y. Larin, Thomas M. Conte 

November 1999 Proceedings of the 32nd annual ACM/IEEE international symposium 
on Microarchitecture 

Publisher: IEEE Computer Society 
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During the last 15 years, embedded systems have grown in complexity and performance 
to rival desktop systems. The architectures of these systems present unique challenges to 
processor microarchitecture, including instruction encoding and instruction fetch 
processes. This paper presents new techniques for reducing embedded system code size 
without reducing functionality. This approach is to extract the pipeline decoder logic for an 
embedded VLIW processor in software at system develo ... 



Code optimization - 1: Local scheduling techniques for memory coherence in a 
clustered VLIW processor with a distributed data cache 
Enric Gibert, Jesus Sanchez, Antonio Gonzalez 

March 2003 Proceedings of the international symposium on Code generation and 
optimization: feedback-directed and runtime optimization CGO '03 

Publisher: IEEE Computer Society 

Full text available: ^]pdf(1.19 MB) Additional Information: full citation , abstract , references , index terms 

Clustering is a common technique to deal with wire delays. Fully-distributed architectures, 
where the register file, the functional units and the cache memory are partitioned, are 
particularly effective to deal with these constraints and besides they are very scalable. 
However, the distribution of the data cache introduces a new problem: memory 
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instructions may reach the cache in an order different to the sequential program order, 
thus possibly violating its contents. In this paper two local sch ... 

Compiler code transformations for superscalar-based high performance systems 
S. A. Mahlke, W. Y. Chen, J. C. Gyllenhaal, W.-M. W. Hwu 

December 1992 Proceedings of the 1992 ACM/IEEE conference on Supercomputing 
Publisher: IEEE Computer Society Press 

Full text available: fg pdf(1.05 MB) Additional Information: full citation , references , citings , index terms 



5 Embedded systems: A VLIW low power Java processor for embedded applications §li 




Antonio Carlos S. Beck, Luigi Carro 

September 2004 Proceedings of the 17th symposium on Integrated circuits and 



system design 

Publisher: ACM Press 

Full text available: ^| pdf(303.80 KB) Additional Information: full citation , abstract , references , index terms 

This paper presents a pioneer VLIW architecture of a native Java processor. We show 
that, thanks to the specific stack architecture and to the use of the VLIW technique, one is 
able to obtain a meaningful reduction of power dissipation, with small area overhead, 
when compared to other ways of executing Java in hardware. The underlying technique is 
based on the reuse of memory access instructions, hence reducing power during memory 
or cache accesses. The architecture is validated for some complex ... 

Keywords: Java, VLIW, power consumption 



Compiler scheduling: Effective instruction scheduling techniques for an interleaved ||| 

cache clustered VLIW processor 
Enric Gibert, Jesus Sanchez, Antonio Gonzalez 

November 2002 Proceedings of the 35th annual ACM/IEEE international symposium 

on Microarchitecture 
Publisher: IEEE Computer Society Press 
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Clustering is a common technique to overcome the wire delay problem incurred by the 
evolution of technology. Fully-distributed architectures, where the register file, the 
functional units and the data cache are partitioned, are particularly effective to deal with 
these constraints and besides they are very scalable. In this paper effective instruction 
scheduling techniques for a clustered VLIW processor with a word-interleaved cache are 
proposed. Such scheduling techniques rely on: (i) loop unro ... 

Regular contributions: DSP architectures: past, present and futures §§§ 
Edwin J. Tan, Wendi B. Heinzelman 

June 2003 ACM SIGARCH Computer Architecture News, volume 31 issue 3 
Publisher: ACM Press 

Full text available: ^ pdf(1.27 MB) Additional Information: full citation , abstract , references 

As far as the future of communication is concerned, we have seen that there is great 
demand for audio and video data to complement text. Digital signal processing (DSP) is 
the science that enables traditionally analog audio and video signals to be processed 
digitally for transmission, storage, reproduction and manipulation. In this paper, we will 
explain the various DSP architectures and its silicon implementation. We will also discuss 
the state-of-the art and examine the issues pertaining to pe ... 

8 Microservers: a new memory semantics for massively parallel computing fcfg| 
Jay B. Brockman, Peter M. Kogge, Thomas L Sterling, Vincent W. Freeh, Shannon K. Kuntz 
May 1999 Proceedings of the 13th international conference on Supercomputing 
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On the use of registers vs. cache to minimize memory traffic 
J. R. Goodman, W. C. Hsu 

June 1986 ACM SIGARCH Computer Architecture News , Proceedings of the 13th 

annual international symposium on Computer architecture ISCA '86, volume 
14 Issue 2 

Publisher: IEEE Computer Society Press, ACM Press 
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terms 

Single-chip computers aFe becoming increasingly limited by the access constraints to off- 
chip memory. To achieve high performance, the structure of on-chip memory must be 
appropriate, and it must be allocated effectively to minimize off-chip communication. We 
report experiments that demonstrate that on-chip memory can be effective for local 
variable accesses. For best use of the limited on-chip area, we suggest organizing 
memory as registers and argue that an effective register spilling sch ... 

10 Code compression: Reducing code size for heterogeneous-connectivity-based VLIW 
DSPs through synthesis of instruction set extensions 
Partha Biswas, Nikil Dutt 

October 2003 Proceedings of the 2003 international conference on Compilers, 
architecture and synthesis for embedded systems 

Publisher: ACM Press 

Full text available: ^| pdf(1 76.82 KB) Additional Information: full citation , abstract , references , index terms 

VLIW DSP architectures exhibit heterogeneous connections between functional units and 
register files for speeding up special tasks. Such architectural characteristics can be 
effectively exploited through the use of complex instruction set extensions (ISEs). 
Although VLIWs are increasingly being used for DSP applications to achieve very high 
performance, such architectures are known to suffer from increased code size. This paper 
addresses how to generate ISEs that can result in significant code s ... 

Keywords: dependence conflict graph, heterogeneous-connectivity-based DSP, 
instruction set architecture, instruction set extensions, restricted data dependence graph, 
static single assignment 



11 Power minimization derived from architectural-usage of VLIW processors 
^ C. Gebotys, R. Gebotys, S. Wiratunga 

N/ June 2000 Proceedings of the 37th conference on Design automation 
Publisher: ACM Press 
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terms 

This paper presents an empirical approach to inferring low power code generation 
techniques for VLIW processors. Architectural usage variables are used to generate 
equations for power prediction which are in turn used to infer new code generation 
techniques for low power. Unlike previous techniques, the methodology empirically 
derives a power prediction equation and then based upon the coefficients of the 
architectural-usage variables identifies new VLIW code generation techniques f ... 

12 Lx: a technology platform for customizable VLIW embedded processing 

Paolo Faraboschi, Geoffrey Brown, Joseph A. Fisher, Giuseppe Desoli, Fred Homewood 
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^ May 2000 ACM SIGARCH Computer Architecture News , Proceedings of the 27th 
vgr annual international symposium on Computer architecture ISCA '00, volume 

28 Issue 2 
Publisher: ACM Press 
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Lx is a scalable and customizable VLIW processor technology platform designed by 
Hewlett-Packard and STMicroelectronics that allows variations in instruction issue width, 
the number and capabilities of structures and the processor instruction set. For Lx we 
developed the architecture and software from the beginning to support both scalability 
(variable numbers of identical processing resources) and customizability (special purpose 
resources). In this paper we consider the followi ... 

13 A VLIW architecture for a trace scheduling compiler 
Robert P. Colwell, Robert P. Nix, John 3. O'Donnell, David B. Papworth, Paul K. Rodman 
October 1987 ACM SIGARCH Computer Architecture News , ACM SIGPLAN Notices , 

ACM SIGOPS Operating Systems Review , Proceedings of the second 
international conference on Architectual support for programming 
languages and operating systems ASPLOS-II, volume 15 , 22 , 21 issue 5 , 10 , 4 
Publisher: IEEE Computer Society Press, ACM Press 

Full text available- r>df(1 59 MB) Additional Information: full citation , abstract , references , citings, index 
" terms 

Very Long Instruction Word (VLIW) architectures were promised to deliver far more than 
the factor of two or three that current architectures achieve from overlapped execution. 
Using a new type of compiler which compacts ordinary sequential code into long 
instruction words, a VLIW machine was expected to provide from ten to thirty times the 
performance of a more conventional machine built of the same implementation 
technology. Multiflow Computer, Inc., has now built a VLIW called the TRACE TM< ■■ 

14 Low power issues: Compiler-directed thermal management for VLIW functional units 
Madhu Mutyam, Feihui Li, Vijaykrishnan Narayanan, Mahmut Kandemir, Mary Jane Irwin 
June 2006 Proceedings of the 2006 ACM SIGP LAN/SIGBED conference on Language, 

compilers and tool support for embedded systems LCTES '06 
Publisher: ACM Press 

Full text available: ^ pdf(599.24 KB) Additional Information: full citation , abstract , references , index terms 

As processors, memories, and other components of today's embedded systems are 
pushed to higher performance in more enclosed spaces, processor thermal management 
is quickly becoming a limiting design factor. While previous proposals mostly approached 
this thermal management problem from circuit and architecture angles, software can also 
play an important role in identifying and eliminating thermal hotspots as it is the main 
factor that shapes the order and frequency of accesses to differen ... 

Keywords: IPC, VLIW, thermal 



15 Dynamic rescheduling: a technique for object code compatibility in VLIW 
architectures 

Thomas M. Conte, Sumedh W. Sathaye 

December 1995 Proceedings of the 28th annual international symposium on 

Microarchitecture 
Publisher: IEEE Computer Society Press 
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16 Application specific architecture design tools: Performance simulation modeling for ||| 
^ fast evaluation of pipelined scalar processor by evaluation reuse 
^ Ho Young Kim, Tag Gon Kim 
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June 2005 Proceedings of the 42nd annual conference on Design automation 

Publisher: ACM Press 

Full text available: ^ pdf(834.72 KB) Additional Information: full citation , abstract , references , index terms 

This paper proposes a rapid and accurate evaluation scheme for cycle counts of a 
pipelined processor using evaluation reuse technique. Since exploration of an optimal 
processor is a time-consuming task due to large design space, fast evaluation 
methodology for an architecture is crucial. We Introduce the performance simulation 
model which can evaluate the performance without considering the functional correctness. 
This model has an FSM-like form and can afford to take all hazard types of pipelin ... 

Keywords: compiled simulation, evaluation reuse, instruction set architecture, 
retargetable simulation, trace-driven simulation 



17 Processor-based system: A Trimaran based framework for exploring the design 

^ s pace of VLIW ASIPs with coarse grain functional units 

^ Bhuvan Middha, Anup Gangwar, Anshul Kumar, M. Balakrishnan, Paolo Ienne 

October 2002 Proceedings of the 15th international symposium on System Synthesis 

Publisher: ACM Press 

Full text available* f£\ pdfC\3'\ 26 KB) Additional Information: full citation , abstract , references , citings, index 
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It is widely accepted that use of an Application Specific Instruction Set Processor (ASIP) in 
an embedded system can provide a solution which is much more flexible than ASICs and 
much more efficient than standard processors in terms of performance and power 
consumption. However a lack of an acceptable design methodology and supporting tools 
for ASIPs limits their use even today. We present in this paper a methodology for design 
space exploration of high performance VLIW ASIPs by modeling Applica ... 

Keywords: ASIP, Trimaran, VLIW, design space exploration, performance 



18 Flexible Compiler-Managed LP Buffers for Clustered VLIW Processors Q 
Enric Gibert, Jesus Sanchez, Antonio Gonzalez 

December 2003 Proceedings of the 36th annual IEEE/ACM International Symposium 
on Microarchitecture 

Publisher: IEEE Computer Society 
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Wire delays are a major concern for current and forthcoming processors. One approach to 
attack this problem is to divide the processorinto semi-independent units referred to as 
clusters. Acluster usually consists of a local register file and a subset of thefunctional 
units, while the data cache remains centralized. However,as technology evolves, the 
latency of such a centralizedcache will increase leading to an important performance 
impact. In this paper we propose to include flexible low-latency ... 

19 Software Pipelinin g for Coarse-Grained Reconfigurable Instruction Set Processors Q 
Francisco Barat, Mural i Jayapala, Pieter Op de Beeck, Geert Deconinck, K. U. Leuven 

January 2002 Proceedings of the 2002 conference on Asia South Pacific design 

automation/VLSI Design 
Publisher: IEEE Computer Society 

Full text available: f£| pdf(191.97 KB) 
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This paper shows that software pipelining can be an effective technique for code 
generation for coarse-grained reconfigurable instruction set processors. The paper 
describes a technique, based on adding an operation assignment phase to software 
pipelining, that performs reconfigurable instruction generation and instruction scheduling 
on a combined algorithm. Although typical compilers for reconfigurable processors 
perform these steps separately, results show that the combination enables a succes ... 
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Keywords: reconfigurable processor, code generation, coarse grained. logic, software 
pipelining, vliw, spatial computation 



20 Computer architecture: A unified processor architecture for RISC & VLIW DSP Q 

#Tay-Jyi Lin, Chie-Min Chao, Chia-Hsien Liu, Pi-Chen Hsiao, Shin-Kai Chen, Li-Chun Lin, Chih- 
Wei Liu, Chein-Wei Jen 

April 2005 Proceedings of the 15th ACM Great Lakes symposium on VLSI 

Publisher: ACM Press 
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This paper presents a unified processor core with two operation modes. The processor 
core works as a compiler-friendly MIPS-Nke core in the RISC mode, and it is a 4-way 
VLIW in its DSP mode, which has distributed and ping-pong register organization 
optimized for stream processing. To minimize hardware, the DSP mode has no control 
construct for program flow, while the data manipulation RISC instructions are executed in 
the DSP datapath. Moreover, the two operation modes can be changed ins ... 

Keywords: digital signal processor, dual-core processor, register organization, variable- 
length instruction encoding 
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